NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Uncovering Epistatic Interactions in SARS-CoV-2 Evolution Through Hidden Markov Models

https://doi.org/10.1177/15578666261423963

Adeniyi, Ayotomiwa Ezekiel; Juyal, Akshay; Skums, Pavel; Patterson, Murray; Zelikovsky, Alex (February 2026, Journal of Computational Biology)

Understanding epistatic interactions, where mutations collectively influence viral fitness, is critical for predicting pathogen evolution. We present a hidden Markov model (HMM) framework that captures the temporal dynamics of epistatic relationships in SARS-CoV-2, addressing limitations of static network-based approaches. Our method models single amino acid variant pairs as a two-state system (linked/unlinked), with emission probabilities derived from linkage disequilibrium theory and transition probabilities optimized via the Baum–Welch algorithm. We implement permutation-based validation with temporal order noise reduction (>80% agreement across five iterations) to distinguish biological signals from stochastic noise. Applied to 2,192,008 spike protein sequences from the United States (March 2020–December 2021), our approach identified three classes of epistatic dynamics: permanent (0.3%), transient (0.3%), and oscillating (0.7%) linkages. Analysis of Alpha variant positions revealed 78% epistatic linkage compared with 1.3% across all spike protein position pairs, with 60% exhibiting oscillating patterns suggestive of frequency-dependent selection. We detected all 17 previously reported epistatic pairs plus 18 novel interactions, including critical connections between positions 69–70 and other functional sites. Notably, Alpha variant epistatic networks were detectable as early as April 2020, months before widespread circulation. Our framework scales to variant-wide analysis, revealing distinct patterns across variants: Delta (96% linkage, 71% oscillating) and Omicron (87% linkage, 56% oscillating). The computational pipeline, implemented with parallelized HMM training and Viterbi decoding, processes hundreds of thousands of position pairs efficiently. By transforming epistasis detection from static to temporal analysis, this work provides computational tools for early variant detection and demonstrates how probabilistic modeling can capture evolutionary dynamics in real-time genomic surveillance systems.
more » « less
Full Text Available
Benchmarking machine learning robustness in Covid-19 genome sequence classification

https://doi.org/10.1038/s41598-023-31368-3

Ali, Sarwan; Sahoo, Bikram; Zelikovsky, Alexander; Chen, Pin-Yu; Patterson, Murray (December 2023, Scientific Reports)

Abstract The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome—millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses, is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models. This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches with different perturbation budgets are more robust (and accurate) than others for specific embedding methods to certain noise simulations on the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
more » « less
Full Text Available
Assessing the Resilience of Machine Learning Classification Algorithms on SARS-CoV-2 Genome Sequences Generated with Long-Read Specific Errors

https://doi.org/10.3390/biom13060934

Sahoo, Bikram; Ali, Sarwan; Chen, Pin-Yu; Patterson, Murray; Zelikovsky, Alexander (June 2023, Biomolecules)

The emergence of third-generation single-molecule sequencing (TGS) technology has revolutionized the generation of long reads, which are essential for genome assembly and have been widely employed in sequencing the SARS-CoV-2 virus during the COVID-19 pandemic. Although long-read sequencing has been crucial in understanding the evolution and transmission of the virus, the high error rate associated with these reads can lead to inadequate genome assembly and downstream biological interpretation. In this study, we evaluate the accuracy and robustness of machine learning (ML) models using six different embedding techniques on SARS-CoV-2 error-incorporated genome sequences. Our analysis includes two types of error-incorporated genome sequences: those generated using simulation tools to emulate error profiles of long-read sequencing platforms and those generated by introducing random errors. We show that the spaced k-mers embedding method achieves high accuracy in classifying error-free SARS-CoV-2 genome sequences, and the spaced k-mers and weighted k-mers embedding methods are highly accurate in predicting error-incorporated sequences. The fixed-length vectors generated by these methods contribute to the high accuracy achieved. Our study provides valuable insights for researchers to effectively evaluate ML models and gain a better understanding of the approach for accurate identification of critical SARS-CoV-2 genome sequences.
more » « less
Full Text Available
From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering

https://doi.org/10.1089/cmb.2021.0302

Melnyk, Andrew; Mohebbi, Fatemeh; Knyazev, Sergey; Sahoo, Bikram; Hosseini, Roya; Skums, Pavel; Zelikovsky, Alex; Patterson, Murray (November 2021, Journal of Computational Biology)

Full Text Available
gpps: an ILP-based approach for inferring cancer progression with mutation losses from single cell data

https://doi.org/10.1186/s12859-020-03736-7

Ciccolella, Simone; Soto Gomez, Mauricio; Patterson, Murray D.; Della Vedova, Gianluca; Hajirasouliha, Iman; Bonizzoni, Paola (December 2020, BMC Bioinformatics)
null (Ed.)
Abstract Background Cancer progression reconstruction is an important development stemming from the phylogenetics field. In this context, the reconstruction of the phylogeny representing the evolutionary history presents some peculiar aspects that depend on the technology used to obtain the data to analyze: Single Cell DNA Sequencing data have great specificity, but are affected by moderate false negative and missing value rates. Moreover, there has been some recent evidence of back mutations in cancer: this phenomenon is currently widely ignored. Results We present a new tool, , that reconstructs a tumor phylogeny from Single Cell Sequencing data, allowing each mutation to be lost at most a fixed number of times. The General Parsimony Phylogeny from Single cell () tool is open source and available at https://github.com/AlgoLab/gpps . Conclusions provides new insights to the analysis of intra-tumor heterogeneity by proposing a new progression model to the field of cancer phylogeny reconstruction on Single Cell data.
more » « less
Full Text Available
Inferring Cancer Progression from Single-Cell Sequencing while Allowing Mutation Losses

https://doi.org/10.1093/bioinformatics/btaa722

Ciccolella, Simone; Ricketts, Camir; Soto Gomez, Mauricio; Patterson, Murray; Silverbush, Dana; Bonizzoni, Paola; Hajirasouliha, Iman; Della Vedova, Gianluca; Martelli, Pier Luigi (August 2020, Bioinformatics)

Abstract Motivation In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions. However, recent studies leveraging Single-Cell Sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. While there exist established computational methods that infer phylogenies with mutation losses, there remain some advancements to be made. Results We present SASC (Simulated Annealing Single-Cell inference): a new and robust approach based on simulated annealing for the inference of cancer progression from SCS data sets. In particular, we introduce an extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of mutation loss in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SASC achieves high levels of accuracy when tested on both simulated and real data sets and in comparison with some other available methods. Availability The Simulated Annealing Single-Cell inference (SASC) tool is open source and available at https://github.com/sciccolella/sasc. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available

Search for: All records